774 research outputs found
Automatic Action Annotation in Weakly Labeled Videos
Manual spatio-temporal annotation of human action in videos is laborious,
requires several annotators and contains human biases. In this paper, we
present a weakly supervised approach to automatically obtain spatio-temporal
annotations of an actor in action videos. We first obtain a large number of
action proposals in each video. To capture a few most representative action
proposals in each video and evade processing thousands of them, we rank them
using optical flow and saliency in a 3D-MRF based framework and select a few
proposals using MAP based proposal subset selection method. We demonstrate that
this ranking preserves the high quality action proposals. Several such
proposals are generated for each video of the same action. Our next challenge
is to iteratively select one proposal from each video so that all proposals are
globally consistent. We formulate this as Generalized Maximum Clique Graph
problem using shape, global and fine grained similarity of proposals across the
videos. The output of our method is the most action representative proposals
from each video. Our method can also annotate multiple instances of the same
action in a video. We have validated our approach on three challenging action
datasets: UCF Sport, sub-JHMDB and THUMOS'13 and have obtained promising
results compared to several baseline methods. Moreover, on UCF Sports, we
demonstrate that action classifiers trained on these automatically obtained
spatio-temporal annotations have comparable performance to the classifiers
trained on ground truth annotation
ClusterNet: Detecting Small Objects in Large Scenes by Exploiting Spatio-Temporal Information
Object detection in wide area motion imagery (WAMI) has drawn the attention
of the computer vision research community for a number of years. WAMI proposes
a number of unique challenges including extremely small object sizes, both
sparse and densely-packed objects, and extremely large search spaces (large
video frames). Nearly all state-of-the-art methods in WAMI object detection
report that appearance-based classifiers fail in this challenging data and
instead rely almost entirely on motion information in the form of background
subtraction or frame-differencing. In this work, we experimentally verify the
failure of appearance-based classifiers in WAMI, such as Faster R-CNN and a
heatmap-based fully convolutional neural network (CNN), and propose a novel
two-stage spatio-temporal CNN which effectively and efficiently combines both
appearance and motion information to significantly surpass the state-of-the-art
in WAMI object detection. To reduce the large search space, the first stage
(ClusterNet) takes in a set of extremely large video frames, combines the
motion and appearance information within the convolutional architecture, and
proposes regions of objects of interest (ROOBI). These ROOBI can contain from
one to clusters of several hundred objects due to the large video frame size
and varying object density in WAMI. The second stage (FoveaNet) then estimates
the centroid location of all objects in that given ROOBI simultaneously via
heatmap estimation. The proposed method exceeds state-of-the-art results on the
WPAFB 2009 dataset by 5-16% for moving objects and nearly 50% for stopped
objects, as well as being the first proposed method in wide area motion imagery
to detect completely stationary objects.Comment: Main paper is 8 pages. Supplemental section contains a walk-through
of our method (using a qualitative example) and qualitative results for WPAFB
2009 datase
Video Fill In the Blank using LR/RL LSTMs with Spatial-Temporal Attentions
Given a video and a description sentence with one missing word (we call it
the "source sentence"), Video-Fill-In-the-Blank (VFIB) problem is to find the
missing word automatically. The contextual information of the sentence, as well
as visual cues from the video, are important to infer the missing word
accurately. Since the source sentence is broken into two fragments: the
sentence's left fragment (before the blank) and the sentence's right fragment
(after the blank), traditional Recurrent Neural Networks cannot encode this
structure accurately because of many possible variations of the missing word in
terms of the location and type of the word in the source sentence. For example,
a missing word can be the first word or be in the middle of the sentence and it
can be a verb or an adjective. In this paper, we propose a framework to tackle
the textual encoding: Two separate LSTMs (the LR and RL LSTMs) are employed to
encode the left and right sentence fragments and a novel structure is
introduced to combine each fragment with an "external memory" corresponding the
opposite fragments. For the visual encoding, end-to-end spatial and temporal
attention models are employed to select discriminative visual representations
to find the missing word. In the experiments, we demonstrate the superior
performance of the proposed method on challenging VFIB problem. Furthermore, we
introduce an extended and more generalized version of VFIB, which is not
limited to a single blank. Our experiments indicate the generalization
capability of our method in dealing with such more realistic scenarios
Cross-View Image Matching for Geo-localization in Urban Environments
In this paper, we address the problem of cross-view image geo-localization.
Specifically, we aim to estimate the GPS location of a query street view image
by finding the matching images in a reference database of geo-tagged bird's eye
view images, or vice versa. To this end, we present a new framework for
cross-view image geo-localization by taking advantage of the tremendous success
of deep convolutional neural networks (CNNs) in image classification and object
detection. First, we employ the Faster R-CNN to detect buildings in the query
and reference images. Next, for each building in the query image, we retrieve
the nearest neighbors from the reference buildings using a Siamese network
trained on both positive matching image pairs and negative pairs. To find the
correct NN for each query building, we develop an efficient multiple nearest
neighbors matching method based on dominant sets. We evaluate the proposed
framework on a new dataset that consists of pairs of street view and bird's eye
view images. Experimental results show that the proposed method achieves better
geo-localization accuracy than other approaches and is able to generalize to
images at unseen locations
- …